We intent to find out what makes a start-up attractive for acquisition, and what drives its acquisition price up or down.
In this study, we plan to analyze the acquisition data of more than 2000 companies from all over the world, which are founded from year 1900 to year 2014. Particulary, we have interests in exploring what factors impact the acquisition price of a company, and trying to predict the company’s acquisition price using regression. The data set has 2473 rows of data.
We consider the following variable as the response variable:
| variable name | description | data scoure | data type | variable type | # of missing records |
|---|---|---|---|---|---|
| aquisition_price_amount | Amount paid for aquisition | acquisitions.csv | int | numerical | 0 |
We consider the following 14 variables as the potential predictor:
| variable name | description | data scoure | data type | variable type | # of missing records |
|---|---|---|---|---|---|
| category_code | Entity category | objects.csv | string | categorical | 475 |
| normalized_name | Normalized entity name | objects.csv | string | categorical | 0 |
| logo_width | Logo width | objects.csv | float | numerical | 0 |
| logo_height | Logo height | objects.csv | float | numerical | 0 |
| description | Description of the entity | objects.csv | string | categorical | 1380 |
| country_code | Country code | objects.csv | string | categorical | 514 |
| state_code | State code | objects.csv | string | categorical | 980 |
| city | City name | objects.csv | string | categorical | 580 |
| region | Region name | objects.csv | string | categorical | 1110 |
| investment_rounds | Number of investment round participated in | objects.csv | int | numerical | 0 |
| investment_companies | Number of companies invested in | objects.csv | int | numerical | 0 |
| milestones | Number of milestones the entity has | objects.csv | int | numerical | 0 |
| relationships | Number of relationships the entity has | objects.csv | int | numerical | 0 |
| acquired_at | Date of deal | acquisition.csv | timestamp | numerical | 0 |
The database comes from a Kaggle public dataset about startup investments published by Justinas Cirtautas at the end of 2019. The dataset covers more 450 thousand company information from all over the world till 2013. Among those, about 10 thousand companies acomplished their acquisition successfully, while 80% of them got 0 dollars in their acquisition.
The entire dataset provides 11 cvs file with 154 columns, which covers six aspects of the startup ecosystem including organizations, individuals, company news, funding rounds, acquisitions, and IPOs. We populated all the 11 csv files. More detials can be found in the appendix at the end of this proposal. Considering the missing data rate, we determine to focus on only two files: acquisition.csv and objects.csv.
Venture capital investments hits 9.5 billion U.S. dollars in the internet industry in the United States, as of first quarter 2020. Other leading VC sectors in terms of investment were healthcare, and software. Inspired by the florish startup world, we want to investigate what factors can impact the company price during acquisition.
In this study, we select the price of the company for acquisition as our ‘Y’, the response variable. Our target is to find a set of predictors and use regression method to predict our Y. We checked all dataset and select 14 varaibles as our potential predictors. Later on we will do feature engineering to determine our X’s in the regression.
Meanwhile, we also want to know some interesting questions, like, whether the company name and the logo size has any impact on the acquisition price, or how the acquisition distributed by geography.
RWe merged two csv file to get our final database startups.csv. For the details in data cleaning, please refer to the Appendix at the end of the proposal.
Below is the snapshot to show we load our data into R successfully. We are including a subset of the columns for simplicity, because some columns have very long text entries.
sample_data = read.csv("startups.csv")
knitr::kable(head(sample_data[, c(
"aquisition_price_amount",
"city",
"funding_total_usd",
"category_code",
"funding_rounds",
"milestones",
"relationships"
)]))
| aquisition_price_amount | city | funding_total_usd | category_code | funding_rounds | milestones | relationships |
|---|---|---|---|---|---|---|
| 20000000 | Culver City | 0 | games_video | 0 | 0 | 6 |
| 47500000 | Mountain View | 5000000 | web | 1 | 3 | 14 |
| 15000000 | San Francisco | 3000000 | games_video | 1 | 2 | 6 |
| 262500000 | Shavano Park | 274999 | hardware | 1 | 0 | 4 |
| 36500000 | Nashville | 15286415 | photo_video | 3 | 1 | 2 |
| 100000000 | San Mateo | 28000000 | hardware | 1 | 3 | 7 |
# read the dataset
data0 = read.csv('startups1.csv', header = TRUE)
First, we checked the distribution of aquisition price.
hist(data0$aquisition_price_amount, breaks = 100)
As we can see from the plot above, the distrbution is highly right skewed. Then we use the box plot to check if there is any outliers.
boxplot(data0$aquisition_price_amount)
As we can see there is one record in the dataset with extremely high acquicition price. We checked the database further and found its acqusition price is 2600 billion, which equals the valuation of google and amazon. So we think this one should be an database error, we will exclude this database in our further analysis.
data1 = data0[data0$aquisition_price_amount < 10^12,]
boxplot(data1$aquisition_price_amount)
After excluding the error in our database, the distribution of aquisition price is still highly right skewed. We will consider the log transformation on the aquisition price.
data1["logprice"] = apply(data1["aquisition_price_amount"],2,log)
hist(data1$logprice, breaks = 100)
boxplot(data1$logprice)
# read the dataset
data0 = read.csv('startups1.csv', header = TRUE)
First, we checked the distribution of aquisition price.
hist(data0$aquisition_price_amount, breaks = 100)
As we can see from the plot above, the distrbution is highly right skewed. Then we use the box plot to check if there is any outliers.
boxplot(data0$aquisition_price_amount)
As we can see there is one record in the dataset with extremely high acquicition price. We checked the database further and found its acqusition price is 2600 billion, which equals the valuation of google and amazon. So we think this one should be an database error, we will exclude this database in our further analysis.
data1 = data0[data0$aquisition_price_amount < 10^12,]
boxplot(data1$aquisition_price_amount)
After excluding the error in our database, the distribution of aquisition price is still highly right skewed. We will consider the log transformation on the aquisition price.
data1["logprice"] = apply(data1["aquisition_price_amount"],2,log)
hist(data1$logprice, breaks = 100)
boxplot(data1$logprice)
The distributino of log aquisition price is relatively sysmetric. From the boxplot we can still find there are many records is outside of the 1.75 IQR. But we think we can start from here to build our model.
Then we checked the precdictor in the database one by one. For category_code, this is a categorical varaible.
summary(data1$category_code)
## advertising analytics automotive biotech
## 84 3 3 263
## cleantech consulting ecommerce education
## 39 32 50 6
## enterprise fashion finance games_video
## 115 3 20 88
## hardware health hospitality local
## 78 10 7 1
## manufacturing medical messaging mobile
## 21 4 5 117
## music nanotech network_hosting news
## 3 1 64 9
## other photo_video public_relations real_estate
## 84 5 74 6
## search security semiconductor social
## 19 36 71 4
## software sports transportation travel
## 403 2 3 4
<<<<<<< HEAD
## web NA's
## 260 475
plot(data1$category_code, data1$logprice)
plot(data1$category_code, data1$logprice)
The records concentrated in several major levels of category_code. As for the boxplot between category_code and the response, we can find significant mean within group difference between some groups, which shows this variable has some prediction power. However, at the same time, the varaible within group is also large, which means only relying on this variable cannot lead to a accurate result. So we decided to create some dummy varaibles for the major levels of this variable and put them in our candidate predictor list.
Then we turn to the name of the company. We extract the length of the company name as a feature.
data1["name_length"] = apply(data1["normalized_name"],2,nchar)
plot(data1$name_length, data1$logprice)
<<<<<<< HEAD
After check the relationship between length of the company name, we found if the company name has more than 30 characters, the acquisition price tend to decrease. so we plan to create a dummy variable, company name lenght > 30 or not, and add it to our candidate predictor list.
Then we check the logo width.
plot(data1$logo_width, data1$logprice)
fit = lm(logprice ~ logo_width, data1)
abline(fit, col = "red")
<<<<<<< HEAD
For the plot we found in fact, quite amount of the logo_width is 0. Then we guess, maybe in this column, 0 means missing value. so we decide to remove that part and replot the above figure.
For the plot we found in fact, quite amount of the logo_width is 0. Then we guess, maybe in this column, 0 means missing value. so we decide to remove that part and replot the above figure.
sum(data1$logo_width == 0)
## [1] 502
temp = data1[data1$logo_width > 0, ]
plot(temp$logo_width, temp$logprice)
fit = lm(logprice ~ logo_width, temp)
abline(fit, col = "red")
<<<<<<< HEAD
It turns out there are 502 records in total has 0 logo_width. After removing them, the relationship between logo_width and log price become not that significant. considering the maximum logo_width is more than 5000 times of the minimum width, the data reliability becomes skeptical.
We further checked the log_height.
plot(data1$logo_height, data1$logprice)
fit = lm(logprice ~ logo_height, data1)
abline(fit, col = "red")
<<<<<<< HEAD
sum(data1$logo_height == 0)
## [1] 502
temp = data1[data1$logo_height > 0, ]
plot(temp$logo_height, temp$logprice)
fit = lm(logprice ~ logo_height, temp)
abline(fit, col = "red")
<<<<<<< HEAD
It show the same pattern. Out of curiosity, we create a variable named, logo_ratio, which is the ration of logo height and width.
mask = data1$logo_height > 0
data1["logo_ratio"] = data1["logo_height"] / data1["logo_width"]
plot(data1$logo_ratio[mask], data1$logprice[mask])
fit = lm(logprice ~ logo_ratio, data1[mask,])
abline(fit, col = "red")
<<<<<<< HEAD
But it turns out seems this variable does not help much. Considering the data sanity issue in those columns, we did not put them to our candidate predictor list.
Then we checked the country_code, which is a categorical variable.
summary(data1$country_code)
## ANT ARE ARG AUS AUT BEL BMU BRA CAN CHE CHL CHN CYM CZE DEU DNK
## 1 1 2 23 3 3 1 4 53 8 1 23 1 1 23 5
## ESP FIN FRA GBR GHA IDN IND IRL ISR ITA JOR JPN KEN KOR LBN LUX
## 5 4 20 100 1 1 10 8 34 6 1 11 1 7 1 2
## MAR MDA MEX MMR MYS NCL NLD NOR NZL PAK PRT RUS SAU SGP SWE TUN
## 1 1 3 1 1 1 13 4 1 2 1 4 1 4 14 1
<<<<<<< HEAD
## TUR TWN USA VGB ZAF NA's
## 1 1 1532 1 5 514
plot(data1$country_code, data1$logprice)
plot(data1$country_code, data1$logprice)
The records concentrated in several major levels of country_code. To be more specific, most records is in USA. As for the boxplot between country_code and the response, we can find significant mean within group difference between some groups, which shows this variable has some prediction power. However, at the same time, the varaible within group is also large, which means only relying on this variable cannot lead to a accurate result. So we decided to create some dummy varaibles for the major levels of this variable and put them in our candidate predictor list.
The same observation found in city.
summary(data1$city)
## San Francisco New York Mountain View Santa Clara
## 99 84 44 44
## London Sunnyvale Palo Alto San Jose
## 42 36 31 29
## Seattle Cambridge San Diego Austin
## 29 28 27 24
## Chicago Los Angeles Redwood City Boston
## 22 21 20 19
## Irvine Waltham San Mateo Atlanta
## 17 16 15 14
## Bellevue Dublin Houston Cupertino
## 13 12 12 11
## Dallas Fremont Boulder Denver
## 11 11 10 10
## Marlborough Menlo Park Tel Aviv Paris
## 9 9 9 8
## Raleigh Santa Monica Tokyo Bedford
## 8 8 8 7
## Durham Pittsburgh Reston South San Francisco
## 7 7 7 7
## Amsterdam Arlington Brisbane Carlsbad
## 6 6 6 6
## Charlotte Englewood Hayward Herndon
## 6 6 6 6
## Milpitas Philadelphia Phoenix Portland
## 6 6 6 6
## Seoul Toronto Vienna Aliso Viejo
## 6 6 6 5
## Baltimore Columbia Lowell Morrisville
## 5 5 5 5
## Overland Park Pleasanton Scottsdale Stockholm
## 5 5 5 5
## Alpharetta American Fork Beijing Bothell
## 4 4 4 4
## Bristol Brooklyn Campbell Corte Madera
## 4 4 4 4
## El Segundo Emeryville Gaithersburg Indianapolis
## 4 4 4 4
## Irving Malvern McLean Melbourne
## 4 4 4 4
## Milwaukee Minneapolis Moscow New York City
## 4 4 4 4
## Newton Oxford Plano Princeton
## 4 4 4 4
## Richardson San Antonio San Bruno San Carlos
## 4 4 4 4
## Sydney Tel-Aviv Toronto, ON Westlake Village
## 4 4 4 4
<<<<<<< HEAD
## Woburn Ann Arbor (Other) NA's
## 4 3 777 580
plot(data1$city, data1$logprice)
The records concentrated in several major levels of city. As for the boxplot between city_code and the response, we can find significant mean within group difference between some groups, which shows this variable has some prediction power. However, at the same time, the varaible within group is also large, which means only relying on this variable cannot lead to a accurate result. So we decided to create some dummy varaibles for the major levels of this variable and put them in our candidate predictor list.
For investiment_rounds, we found it showed a slightly positive correlation with log price. We added it to our candidate predictor list.
plot(data1$investment_rounds, data1$logprice)
For investiment_companies, we found it showed a slightly positive correlation with log price. We added it to our candidate predictor list.
plot(data1$invested_companies, data1$logprice)
For milestones, we found it showed a slightly positive correlation with log price. We added it to our candidate predictor list.
plot(data1$milestones, data1$logprice)
plot(data1$city, data1$logprice)
The records concentrated in several major levels of city. As for the boxplot between city_code and the response, we can find significant mean within group difference between some groups, which shows this variable has some prediction power. However, at the same time, the varaible within group is also large, which means only relying on this variable cannot lead to a accurate result. So we decided to create some dummy varaibles for the major levels of this variable and put them in our candidate predictor list.
For investiment_rounds, we found it showed a slightly positive correlation with log price. We added it to our candidate predictor list.
plot(data1$investment_rounds, data1$logprice)
For investiment_companies, we found it showed a slightly positive correlation with log price. We added it to our candidate predictor list.
plot(data1$invested_companies, data1$logprice)
For milestones, we found it showed a slightly positive correlation with log price. We added it to our candidate predictor list.
plot(data1$milestones, data1$logprice)
For relationships, we found it showed a slightly positive correlation with log price. We added it to our candidate predictor list.
plot(data1$relationships, data1$logprice)
fit = lm(logprice ~ relationships, data1)
abline(fit, col = "red")
<<<<<<< HEAD
### Correlation Analysis Then we checked the correlation matrix of the numerical variables.
data2 = data1[,c("name_length", "investment_rounds", "invested_companies", "milestones", "relationships")]
cor(data2)
## name_length investment_rounds invested_companies milestones
## name_length 1.00000000 -0.04786664 -0.04856472 -0.2870902
## investment_rounds -0.04786664 1.00000000 0.98970339 0.1519085
## invested_companies -0.04856472 0.98970339 1.00000000 0.1576869
## milestones -0.28709018 0.15190852 0.15768691 1.0000000
## relationships -0.14405538 0.39209720 0.40107629 0.4476997
## relationships
## name_length -0.1440554
=======
#### Correlation Analysis Then we checked the correlation matrix of the numerical variables.
data2 = data1[,c("name_length", "investment_rounds", "invested_companies", "milestones", "relationships")]
cor(data2)
## name_length investment_rounds invested_companies milestones
## name_length 1.00000000 -0.04775877 -0.04843915 -0.2870815
## investment_rounds -0.04775877 1.00000000 0.98970339 0.1519085
## invested_companies -0.04843915 0.98970339 1.00000000 0.1576869
## milestones -0.28708145 0.15190852 0.15768691 1.0000000
## relationships -0.14409414 0.39209720 0.40107629 0.4476997
## relationships
## name_length -0.1440941
>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1
## investment_rounds 0.3920972
## invested_companies 0.4010763
## milestones 0.4476997
## relationships 1.0000000
It turns out the investment_rounds and invested_companies shows strongly collinearity. We should only use one of them in the model to avoid the collinearity issue.
Based on the single predictor analysis above, it is now possible to further simplify the dataset that will be used to determine whether the aquisition_price of startups can be correlated with startup characteristics and their performance.
Some data adjustments include:
category_code, country_code, city, state_code. This will ommit potential dummy variables when dealing with factor variables with levels that provide little significance. The fewer levels and variables also adds benefit of simplifying the model.company_id,normalized_name, status, logo_width, logo_height. These are pure descriptions of the startups, and while these could have been a source for additional continuous (e.g. word counts) or discreet variables (mention of certain trendy keywords), we decided to leave out for now to narrow down the analysis on the other variables.startups_tr), with 80% of the observations, and the rest will be included in a test dataset (startups_te), useful for model cross-validation. The split is done randomly.#Loading of the data
library(readr)
startups = read_csv("./startups.csv")
attr(startups, "spec") = NULL
n = nrow(startups)
#Function that reduces the number of levels of a factor, by replacing some levels with the new.level based on the number of records a level has.
#If threshold < 1, then it keeps the largest levels by counts up to when a the threshold percentile is reached
#If threshold >= 1, then it keeps the levels that have counts greater or equal to the threshold.
shorten_levels = function(col, threshold, new.level){
new_col = as.character(col)
levels_to_collapse = 0
level_counts = summary(col)
counts = aggregate(new_col, by = list(col = new_col), FUN = length)
if(threshold < 1){
counts = counts[order(counts$x, decreasing = TRUE), ]
counts$cumpct = cumsum(counts$x)/sum(counts$x)
levels_to_collapse = counts[counts$cumpct >= threshold, "col"]
}else{
#levels_to_collapse = names(which(level_counts > threshold))
levels_to_collapse = counts[counts$x < threshold, "col"]
}
replace = new_col %in% levels_to_collapse
new_col[replace] = new.level
as.factor(new_col)
}
#Modified function from https://www.mjdenny.com/Text_Processing_In_R.html to "normalize text"
clean_string = function(string){
# Lowercase
temp = tolower(string)
# Remove everything that is not a number or letter (
temp = stringr::str_replace_all(temp,"[^a-zA-Z\\s]", " ")
# Shrink down to just one white space
temp = stringr::str_replace_all(temp,"[\\s]+", " ")
temp
}
#Collapse the number of levels for all factor predictors.
startups$category_code = shorten_levels(startups$category_code, .90, "other")
startups$country_code = shorten_levels(startups$country_code, 20, "other")
startups$city = shorten_levels(clean_string(as.character(startups$city)),20, "other")
startups$state_code = shorten_levels(as.character(startups$state_code),.90, "other")
summary(startups$state_code)
## CA CO FL IL MA MD NJ NY other PA
## 581 34 37 42 118 32 46 110 273 41
## TX unknown VA WA
## 77 980 43 59
#Remove unnecessary columns (particularly the descriptions and single-level factors and variables that did not seem relevant for the aquisition response variable)
to_remove = c("company_id","normalized_name", "status", "logo_width", "logo_height")
startups_desc = startups[, to_remove]
startups = startups[, -which(colnames(startups) %in% to_remove)]
#Remove outlier in acquisition price
startups = startups[startups$aquisition_price_amount<1e12,]
#Split of data for cross-validation: 80% for training, 20$ for testing.
set.seed(420)
train_index = sample(1:nrow(startups), size = floor(0.8 * nrow(startups)))
startups_tr = startups[train_index,]
startups_te = startups[-train_index,]
The final dataset for analysis includes the following predictors.
str(startups)
<<<<<<< HEAD
## Classes 'tbl_df', 'tbl' and 'data.frame': 2472 obs. of 16 variables:
=======
## tibble [2,472 x 16] (S3: tbl_df/tbl/data.frame)
>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1
## $ category_code : Factor w/ 13 levels "advertising",..: 4 13 4 5 8 5 3 13 1 8 ...
## $ country_code : Factor w/ 10 levels "AUS","CAN","CHN",..: 10 10 10 10 10 10 10 10 7 8 ...
## $ state_code : Factor w/ 14 levels "CA","CO","FL",..: 1 1 1 11 9 1 1 1 12 12 ...
## $ city : Factor w/ 17 levels "austin","cambridge",..: 8 6 12 8 8 8 12 6 8 8 ...
<<<<<<< HEAD
## $ investment_rounds : num 0 0 0 0 0 0 0 0 0 0 ...
## $ invested_companies : num 0 0 0 0 0 0 0 0 0 0 ...
## $ funding_rounds : num 0 1 1 1 3 1 5 2 0 0 ...
## $ funding_total_usd : num 0 5000000 3000000 274999 15286415 ...
## $ milestones : num 0 3 2 0 1 3 3 3 0 2 ...
## $ relationships : num 6 14 6 4 2 7 38 2 2 59 ...
## $ aquisition_price_amount: num 2.00e+07 4.75e+07 1.50e+07 2.62e+08 3.65e+07 ...
## $ top25pct : num 0 64 0 100 0 0 55 100 0 20 ...
## $ top50pct : num 0 18 0 0 0 0 0 0 0 7 ...
## $ istop25 : num 0 1 0 1 0 0 1 1 0 1 ...
## $ istop26_50 : num 0 1 0 0 0 0 0 0 0 1 ...
## $ name_length : num 7 10 7 8 9 9 6 8 9 8 ...
=======
## $ investment_rounds : num [1:2472] 0 0 0 0 0 0 0 0 0 0 ...
## $ invested_companies : num [1:2472] 0 0 0 0 0 0 0 0 0 0 ...
## $ funding_rounds : num [1:2472] 0 1 1 1 3 1 5 2 0 0 ...
## $ funding_total_usd : num [1:2472] 0 5000000 3000000 274999 15286415 ...
## $ milestones : num [1:2472] 0 3 2 0 1 3 3 3 0 2 ...
## $ relationships : num [1:2472] 6 14 6 4 2 7 38 2 2 59 ...
## $ aquisition_price_amount: num [1:2472] 2.00e+07 4.75e+07 1.50e+07 2.62e+08 3.65e+07 ...
## $ top25pct : num [1:2472] 0 64 0 100 0 0 55 100 0 20 ...
## $ top50pct : num [1:2472] 0 18 0 0 0 0 0 0 0 7 ...
## $ istop25 : num [1:2472] 0 1 0 1 0 0 1 1 0 1 ...
## $ istop26_50 : num [1:2472] 0 1 0 0 0 0 0 0 0 1 ...
## $ name_length : num [1:2472] 7 10 7 8 9 9 6 8 9 8 ...
>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1
In order to compare models, we use a single function that runs tests on both:
These diagnostics functions are further used in another set of utility functions that provide a side-by-side view of the model performance across the model assumptions, performance, model parameters and their caracteristics.
library(lmtest)
library(MASS)
library(faraway)
get_outliers = function(model){
stdresid = rstandard(model)
outliers = abs(stdresid) > 2
stats = c(
num_outliers = sum(outliers),
pct_outliers = mean(outliers)
)
list(stats = stats, outliers = outliers)
}
get_influentials = function(model){
cooks = cooks.distance(model)
influential = cooks > (4 / length(cooks))
stats = c(
num_influentials = sum(influential),
pct_influentials = mean(influential)
)
list(stats = stats, influentials = influential)
}
get_rmse = function(model, predict_function = NULL, test_actuals = NULL){
predictions = 0
if(!is.null(predict_function))
predict_function(model)
rsme_val = 0
if(!is.null(test_actuals))
rsme_val = sqrt(mean((test_actuals - predictions) ^ 2))
else
rsme_val = sqrt(mean(resid(model) ^ 2))
rsme_val
}
#get_rmse(model_transformation_boxcox_selected, predict_function = boxcox_reverse_function, test_actuals = startups_te$aquisition_price_amount)
# Diagnostic function that performs multiple hypothess tests for the model significance, whether the linear model assumptions are valid, and other model performance metrics
diagnostics_performance = function(model, pcol = "grey", lcol = "dodgerblue", plotit = TRUE, testit = TRUE, response_transform = NULL, train_actuals = NULL, as_data_frame=TRUE) {
#Plot functionality
if (plotit) {
par(mfrow = c(1, 2))
#Fitted vs residuals plot
plot(resid(model) ~ fitted(model),
col = pcol, pch = 20,
main = "Fitted vs Residuals Plot", xlab = "Fitted", ylab = "Residuals")
abline(h = 0, col = lcol, lwd = 2)
#Q-Q plot
qqnorm(resid(model), col = pcol, main = "Normal Q-Q Plot")
qqline(resid(model), lty = 2, lwd = 2, col = lcol)
}
#Test functionality
if (testit) {
resids = resid(model)
if(!is.null(response_transform) && !is.null(train_actuals)){
fits = response_transform(fitted(model))
resids = train_actuals-fits
}
shapiro_test_pval = shapiro.test(resids)$p.value
bp_test_pval = unname(bptest(model)$p.value)
rmse_loocv = sqrt(mean((resids / (1 - hatvalues(model))) ^ 2))
r_squared = summary(model)$r.squared
adj_r_squared = summary(model)$adj.r.squared
aic = extractAIC(model)[2]
bic = extractAIC(model, k = log(n))[2]
outliers = unname(get_outliers(model)$stats)
influentials = unname(get_influentials(model)$stats)
model_pval = unname(pf(summary(model)$fstatistic[1],summary(model)$fstatistic[2],summary(model)$fstatistic[3], lower.tail=FALSE))
rmse_train = sqrt(mean(resids ^ 2))
#Output from test
res = list(
"coefs" = nrow(summary(model)$coefficients),
"model.pval" = model_pval,
"r2" = r_squared,
"rmse.train" = rmse_train,
"rmse.test" = NA,
"bptest.pval" = bp_test_pval,
"shapiro.pval" = shapiro_test_pval,
"AIC" = aic,
"BIC" = bic,
"adjR2" = adj_r_squared,
"rsme_looocv" = rmse_loocv,
"outliers.num" = outliers[1],
"outliers.pct" = outliers[2],
"influential.num" = influentials[1],
"influential.pct" = influentials[2])
if(as_data_frame)
res = data.frame(unlist(res))
res
}
}
#Shows the variable infation factors and their variables for the parameters with high variability due to colinearity.
get_vifs = function(model){
vifs = faraway::vif(model)
vifs = data.frame(coeficient =names(vifs[vifs>5]), vif = unname(vifs[vifs>5]))
vifs = vifs[order(vifs$vif, decreasing=TRUE),]
vifs
}
#Compares two or more models side-by-side in a single dataframe, all model assumptions and prformance.
compare_model_performance = function(models, predict_transforms = NULL, test_data, test_actuals_col, train_actuals = NULL, as_data_frame = FALSE){
dummy = unlist(diagnostics_performance(models[[1]], plotit = F))
model_comp = matrix(0, nrow = length(dummy), ncol = length(models))
col_names = rep(0, length(models))
#View(model_comp)
for(m in 1:length(models)){
model = models[[m]]
model_comp[,m] = as.numeric(unlist(diagnostics_performance(model, plotit = FALSE, response_transform = predict_transforms[[m]], train_actuals = train_actuals, as_data_frame = FALSE)))
transform_func = function(x){x}
transform_func = ifelse(is.null(predict_transforms), transform_func, predict_transforms[[m]])
predict_func = custom_predict(model, test_data = test_data, fitted_transform = transform_func)
actuals = as.numeric(unlist(test_data[,test_actuals_col]))
rmse_test = get_rmse(model, predict_function = predict_func, test_actuals = actuals)
print(rownames(model_comp))
print(model_comp)
model_comp[names(dummy) == "rmse.test", m] = rmse_test #
}
#print(col_names)
rownames(model_comp) = names(dummy)
model_comp
}
<<<<<<< HEAD
=======
#compare_without_influential(model_transformation_boxcox_selected, train_data = startups_tr, test_data = startups_te, test_actuals_col = "aquisition_price_amount", response_transform = boxcox_reverse_function(lambda = 0.12))
>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1
#Helper function to allow custom response transforamtions, relevant to compare models.
custom_predict = function(model, test_data, fitted_transform = NULL){
fitted.transform = ifelse(is.null(fitted_transform), function(x){x}, fitted_transform)
function(model){
fitted_transform(predict(model, test_data))
}
}
#View model diagnostics with and without influential points.
compare_without_influential = function(model, train_data, test_data, test_actuals_col, response_transform = NULL){
formula = as.character(model$call)[2]
influentials = get_influentials(model)$influentials
new_model = lm(as.formula(formula), data = train_data, subset = !influentials)
compare_coef = compare_model_performance(list(model, new_model), predict_transforms = list(response_transform, response_transform), test_data = test_data, test_actuals_col = test_actuals_col, train_actuals = as.numeric(unlist(train_data[,test_actuals_col])))
colnames(compare_coef) = c("W/ Influential", "W/o Influential")
compare_coef
}
#List all coefficients used by all models, along with a view of coefficient values
compare_parameters = function(models, variables, model_names = 0){
all_coefs = data.frame(coefficients = rownames(summary(models[[1]])$coefficients))
col_names = c("coefficients")
for(m in 1:length(models)){
model = models[[m]]
model_name = ifelse(model_names == 0, m, model_names[m])
new_items = data.frame(coefficients = rownames(summary(model)$coefficients))
for(v in 1:length(variables)){
new_items = cbind(new_items,
variable = summary(model)$coefficients[,variables[v]])
colnames(new_items)[v+1] = paste(model_name,"_",v)
}
all_coefs = merge(all_coefs, new_items, by = "coefficients", all = TRUE)
#colnames(all_coefs) = col_names
}
#colnames(all_coefs) = col_names
all_coefs
}
All variables
The first step of the exploration is to look at a simple additive model that contains all variables.
#Training the model
additive_full = lm(aquisition_price_amount ~ . , data = startups_tr)
#Diagnostics and performance of the model
knitr::kable(diagnostics_performance(additive_full), digits = 2)
<<<<<<< HEAD
| unlist.res. | |
|---|---|
| coefs | 6.200000e+01 |
| model.pval | 0.000000e+00 |
| r2 | 1.600000e-01 |
| rmse.train | 1.685398e+09 |
| rmse.test | NA |
| bptest.pval | 0.000000e+00 |
| shapiro.pval | 0.000000e+00 |
| AIC | 8.412779e+04 |
| BIC | 8.448821e+04 |
| adjR2 | 1.300000e-01 |
| rsme_looocv | 1.767236e+09 |
| outliers.num | 3.900000e+01 |
| outliers.pct | 2.000000e-02 |
| influential.num | 5.400000e+01 |
| influential.pct | 3.000000e-02 |
#Coefficients with high Variance inflation factors
knitr::kable(get_vifs(additive_full), digits = 2)
| coeficient | vif | |
|---|---|---|
| 14 | invested_companies | 49.28 |
| 13 | investment_rounds | 49.22 |
| 7 | country_codeUSA | 46.53 |
| 10 | cityother | 45.72 |
| 12 | cityunknown | 44.11 |
| 6 | country_codeunknown | 28.09 |
| 8 | state_codeunknown | 25.73 |
| 9 | citynew york | 9.44 |
| 11 | citysan francisco | 8.59 |
| 5 | country_codeother | 7.07 |
| 3 | category_codeunknown | 6.84 |
| 4 | country_codeGBR | 5.66 |
| 2 | category_codesoftware | 5.36 |
| 1 | category_codeother | 5.10 |
We can see that model assumptions are clearly violated. We also see that there are variables with a potentially large variance inflation factor. For example, it would make sense to only use investment_rounds or invested_companies, or country or state, because there can be collinearity as discussed in the data exploration section. 1.97% and 2.73% of points are considered to be outliers and to have influence respectevely. Although we can aim to reduce these, in this first exploration of different additive models it does not seem likely this values will change much.
#Direct correlation between investment_rounds and invested_companies is close to one
cor(startups_tr$investment_rounds,startups_tr$invested_companies)
## [1] 0.9889835
#The partial correlation coefficient is also very close to zero.
#The variation of invested_companies that is unexplained by remaining predictors shows little correlation
#with the variation of aquisition_price_amount that is not explained remaining predictors.
#Adding invested companies is of little use to the model.
model_test_invested_companies_1 = lm(invested_companies ~ .-aquisition_price_amount , data = startups_tr)
model_test_invested_companies_2 = lm(aquisition_price_amount ~ .-invested_companies , data = startups_tr)
cor(resid(model_test_invested_companies_1), resid(model_test_invested_companies_2))
## [1] 0.1080731
We remove the invested_companies and state_code from the predictors and train a new model.
Improving collinearity
#Training the model
additive_full_updated = lm(aquisition_price_amount ~ .-investment_rounds -state_code , data = startups_tr)
#Diagnostics and performance of the model
knitr::kable(diagnostics_performance(additive_full_updated), digits = 2)
<<<<<<< HEAD
| unlist.res. | |
|---|---|
| coefs | 4.800000e+01 |
| model.pval | 0.000000e+00 |
| r2 | 1.300000e-01 |
| rmse.train | 1.709368e+09 |
| rmse.test | NA |
| bptest.pval | 0.000000e+00 |
| shapiro.pval | 0.000000e+00 |
| AIC | 8.415563e+04 |
| BIC | 8.443466e+04 |
| adjR2 | 1.100000e-01 |
| rsme_looocv | 1.764967e+09 |
| outliers.num | 3.500000e+01 |
| outliers.pct | 2.000000e-02 |
| influential.num | 4.400000e+01 |
| influential.pct | 2.000000e-02 |
#Coefficients with high Variance inflation factors
knitr::kable(get_vifs(additive_full_updated), digits = 2)
| coeficient | vif | |
|---|---|---|
| 9 | cityother | 32.01 |
| 11 | cityunknown | 30.18 |
| 7 | country_codeUSA | 25.38 |
| 6 | country_codeunknown | 24.98 |
| 5 | country_codeother | 7.07 |
| 3 | category_codeunknown | 6.81 |
| 10 | citysan francisco | 5.95 |
| 4 | country_codeGBR | 5.65 |
| 2 | category_codesoftware | 5.35 |
| 1 | category_codeother | 5.08 |
| 8 | citynew york | 5.03 |
In the plots, we can see that normality and constant variance are still suspect. We also see confirm that normality suspect due to the low shapiro-wilk test p-value, and that the constant variance is suspect due to low bp test p-value. The R squared and adjusted R squared remain somewhat similar, as well as the RMSE leave one out cross validation. But we now have less variables with a potentially large collinearity. Outlier and influential points remain somewhat similar.
Backwards AIC
Now we will perform a backwards AIC search to optimize for AIC and further decrease collinearity issues.
#Finding the model
model_additive_selected_AIC = step(additive_full_updated, trace = 0)
#Diagnostics and performance of the model
knitr::kable(diagnostics_performance(model_additive_selected_AIC), digits = 2)
<<<<<<< HEAD
| unlist.res. | |
|---|---|
| coefs | 1.800000e+01 |
| model.pval | 0.000000e+00 |
| r2 | 1.200000e-01 |
| rmse.train | 1.719315e+09 |
| rmse.test | NA |
| bptest.pval | 0.000000e+00 |
| shapiro.pval | 0.000000e+00 |
| AIC | 8.411857e+04 |
| BIC | 8.422321e+04 |
| adjR2 | 1.200000e-01 |
| rsme_looocv | 1.757006e+09 |
| outliers.num | 3.900000e+01 |
| outliers.pct | 2.000000e-02 |
| influential.num | 5.000000e+01 |
| influential.pct | 3.000000e-02 |
#Coefficients with high Variance inflation factors
knitr::kable(get_vifs(model_additive_selected_AIC), digits = 2)
| coeficient | vif | |
|---|---|---|
| 2 | category_codeunknown | 5.96 |
| 1 | category_codesoftware | 5.16 |
The found model still violates normality and constant variance assumptions, as would normally be expected since this is derived from a model that also violated them. However, we are able to see less predictors with a variance inflation factor higher than 5. Outlier and influential points remain similar. Now we proceed to compare both models with an anova test and decide which is the best additive model.
Selecting the best additive model
anova(model_additive_selected_AIC, additive_full_updated)
## Analysis of Variance Table
##
## Model 1: aquisition_price_amount ~ category_code + invested_companies +
## funding_rounds + relationships + top50pct + istop26_50
## Model 2: aquisition_price_amount ~ (category_code + country_code + state_code +
## city + investment_rounds + invested_companies + funding_rounds +
## funding_total_usd + milestones + relationships + top25pct +
## top50pct + istop25 + istop26_50 + name_length) - investment_rounds -
## state_code
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1959 5.8441e+21
## 2 1929 5.7767e+21 30 6.7425e+19 0.7505 0.8334
We can see that the larger model doesn’t offer a statistically significant difference in explaining the relationship of predictors and response. For that reason, from the additive models, we prefer the model found through the backwards BIC method model_additive_selected_AIC. This model still violates normality and constant variance assumptions, and these will not likely be corrected or at least further improved until we look at more complex interaction models or we apply transformations to predictor or response variables (done below).
All in all, we won’t be focusing on this purely additive model model_additive_selected_AIC because of the very low r squared and potential low power to make predictions.
Two-way interactions
We started the exploration by training a 2-way interaction model, excluding investment_rounds and state_code from the variables, due to collinearity.
We also tried to fit an overall 3-way model in a couple of ways. First, by considering all terms besides investment_rounds and state_code, but due to computational limitations we decided to take a different approach (after 3 hours the model was not complete training yet). We also tried to fit an overall 3-way model, excluding city from the interaction terms (and adding it as a separate term on its own) to try to improve machine performance–altough this model offered a good initial R squared (~0.9) it was hard to interpret and still taxing on machine resources.
#Training the model
interaction_model = lm(aquisition_price_amount ~ (. -investment_rounds -state_code)^2 , data = startups_tr)
#Diagnostics and performance of the model
knitr::kable(na.omit(diagnostics_performance(interaction_model), digits = 2))
<<<<<<< HEAD
| unlist.res. | |
|---|---|
| coefs | 6.510000e+02 |
| model.pval | 0.000000e+00 |
| r2 | 7.390411e-01 |
| rmse.train | 9.380137e+08 |
| bptest.pval | 7.170000e-05 |
| shapiro.pval | 0.000000e+00 |
| AIC | 8.298877e+04 |
| BIC | 8.677316e+04 |
| adjR2 | 6.111201e-01 |
| rsme_looocv | Inf |
Although this model mathematically has a good R squared and adjusted R squared, the normality and constant variance assumptions are still violated. Furthermore, because there are 906 predictors, this is not an easy to understand model. We will find a simplified model doing a backwards AIC search.
Forward AIC
Because the previous model has 906 predictors, we use a forward search to optimize for computational resources. We tried to follow use a BIC approach so that we optimize for reducing the number of predictors, but the resulting R squared was quite low (0.10), so we decided to continue with the AIC method.
#Finding the model
model_interaction_start = lm(aquisition_price_amount ~ 1, data = startups_tr)
model_interaction_selected = step(
model_interaction_start,
scope = ~ category_code * funding_rounds + category_code * milestones +
category_code * aquisition_price_amount + category_code * top50pct + category_code *
istop26_50 + category_code * country_code + category_code * city + category_code *
invested_companies + category_code * funding_total_usd + category_code *
relationships + category_code * top25pct + category_code * istop25 + category_code *
name_length + funding_rounds * milestones + funding_rounds * aquisition_price_amount +
funding_rounds * top50pct + funding_rounds * istop26_50 + funding_rounds *
country_code + funding_rounds * city + funding_rounds * invested_companies +
funding_rounds * funding_total_usd + funding_rounds * relationships + funding_rounds *
top25pct + funding_rounds * istop25 + funding_rounds * name_length + milestones *
aquisition_price_amount + milestones * top50pct + milestones * istop26_50 +
milestones * country_code + milestones * city + milestones * invested_companies +
milestones * funding_total_usd + milestones * relationships + milestones *
top25pct + milestones * istop25 + milestones * name_length + aquisition_price_amount *
top50pct + aquisition_price_amount * istop26_50 + aquisition_price_amount *
country_code + aquisition_price_amount * city + aquisition_price_amount *
invested_companies + aquisition_price_amount * funding_total_usd + aquisition_price_amount *
relationships + aquisition_price_amount * top25pct + aquisition_price_amount *
istop25 + aquisition_price_amount * name_length + top50pct * istop26_50 +
top50pct * country_code + top50pct * city + top50pct * invested_companies +
top50pct * funding_total_usd + top50pct * relationships + top50pct * top25pct +
top50pct * istop25 + top50pct * name_length + istop26_50 * country_code +
istop26_50 * city + istop26_50 * invested_companies + istop26_50 * funding_total_usd +
istop26_50 * relationships + istop26_50 * top25pct + istop26_50 * istop25 +
istop26_50 * name_length + country_code * city + country_code * invested_companies +
country_code * funding_total_usd + country_code * relationships + country_code *
top25pct + country_code * istop25 + country_code * name_length + city *
invested_companies + city * funding_total_usd + city * relationships + city *
top25pct + city * istop25 + city * name_length + invested_companies * funding_total_usd +
invested_companies * relationships + invested_companies * top25pct + invested_companies *
istop25 + invested_companies * name_length + funding_total_usd * relationships +
funding_total_usd * top25pct + funding_total_usd * istop25 + funding_total_usd *
name_length + relationships * top25pct + relationships * istop25 + relationships *
name_length + top25pct * istop25 + top25pct * name_length + istop25 * name_length,
direction = "forward",
trace = 0
)
#Diagnostics and performance of the model
knitr::kable(na.omit(diagnostics_performance(model_interaction_selected), digits = 2))
<<<<<<< HEAD
| unlist.res. | |
|---|---|
| coefs | 8.700000e+01 |
| model.pval | 0.000000e+00 |
| r2 | 5.475401e-01 |
| rmse.train | 1.235132e+09 |
| bptest.pval | 0.000000e+00 |
| shapiro.pval | 0.000000e+00 |
| AIC | 8.294879e+04 |
| BIC | 8.345454e+04 |
| adjR2 | 5.269520e-01 |
| rsme_looocv | Inf |
Although this model optimizes for AIC (it has a lower AIC than the first two-way interaction model), this model performs worse in other areas. In particular the both the R squared and adjusted R squared decreased (0.55 and 0.53 respectively). The normality and constant variance assumptions continue to be violated.
Selecting the best interaction model
Although we prefer the first two-way interaction model due to its better performance, we run an anova test to check the significance of the additional parameters included on the first two-way model vs the model found via forward AIC.
anova(model_interaction_selected, interaction_model)
## Analysis of Variance Table
##
## Model 1: aquisition_price_amount ~ relationships + category_code + funding_rounds +
## istop26_50 + top50pct + invested_companies + funding_total_usd +
## milestones + relationships:category_code + category_code:funding_rounds +
## funding_rounds:istop26_50 + relationships:top50pct + category_code:invested_companies +
## category_code:istop26_50 + category_code:top50pct + relationships:invested_companies +
## funding_rounds:invested_companies + top50pct:invested_companies +
## relationships:istop26_50 + top50pct:milestones
## Model 2: aquisition_price_amount ~ ((category_code + country_code + state_code +
## city + investment_rounds + invested_companies + funding_rounds +
## funding_total_usd + milestones + relationships + top25pct +
## top50pct + istop25 + istop26_50 + name_length) - investment_rounds -
## state_code)^2
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1890 3.0160e+21
## 2 1326 1.7395e+21 564 1.2765e+21 1.7253 1.137e-15 ***
## ---
<<<<<<< HEAD
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
=======
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1
The anova test confirms that the initial model does include predictors that are significant. The preferred interaction model is interaction_model.
Boxcox method
We start with using the same predictors used in the first 2-way interaction model. We use a \(\lambda\) value of 0.12. Also, we apply a logarithmic transformation in the predictors funding_total_usd and relationships (+1 is added to these predictors because there are records with the value 0 which would lead to issues when applying the log transformation), as this showed an improvement in decreasing adjusted r squared and AIC.
#Initial model, before transformation
model_transformation_boxcox_initial = lm(aquisition_price_amount ~ (. -investment_rounds -state_code)^2 + log(funding_total_usd+1) +log(relationships+1) , data = startups_tr)
#Boxcox plot
boxcox(model_transformation_boxcox_initial, plotit = TRUE, lambda = seq(0.1, 0.14, by = 0.01))
<<<<<<< HEAD
#Training the model
model_transformation_boxcox = lm((((aquisition_price_amount ^ 0.12) - 1) / 0.12) ~ (. -investment_rounds -state_code)^2 + log(funding_total_usd+1) +log(relationships+1) , data = startups_tr)
#Define the transformation functions to revert back to the units of aquisition_price
boxcox_reverse_function = function(lambda = 0.12){
function(val){
(val*lambda + 1)^(1/lambda)
}
}
#Diagnostics and performance of the model
knitr::kable(na.omit(diagnostics_performance(model_transformation_boxcox, response_transform = boxcox_reverse_function(lambda = 0.12), train_actuals = startups_tr$aquisition_price_amount)), digits = 2)
<<<<<<< HEAD
| unlist.res. | |
|---|---|
| coefs | 6.530000e+02 |
| model.pval | 0.000000e+00 |
| r2 | 6.100000e-01 |
| rmse.train | 1.106672e+09 |
| bptest.pval | 1.000000e+00 |
| shapiro.pval | 0.000000e+00 |
| AIC | 1.165783e+04 |
| BIC | 1.545384e+04 |
| adjR2 | 4.100000e-01 |
Although in this model we are seeing lower r squared and adjusted r squared values than we obtained with the two-way interaction model, now the constant variance assumption has improved (we can see this both in the fitted vs residuals plot, and in the result of the bp test, which provides a p-value of 0.9997102). Normality is still suspect.
Forward AIC, using boxcox model as the scope
We use the previously trained model as the full scope for a forward AIC search.
#Finding the model
model_transformation_boxcox_start = lm(((aquisition_price_amount ^ 0.12 - 1)/ 0.12) ~ 1, data = startups_tr)
model_transformation_boxcox_selected = step(
model_transformation_boxcox_start,
scope =
((aquisition_price_amount ^ 0.12 - 1)/ 0.12) ~ category_code * funding_rounds + category_code * milestones +
category_code * aquisition_price_amount + category_code * top50pct + category_code *
istop26_50 + category_code * country_code + category_code * city + category_code *
invested_companies + category_code * funding_total_usd + category_code *
relationships + category_code * top25pct + category_code * istop25 + category_code *
name_length + funding_rounds * milestones + funding_rounds * aquisition_price_amount +
funding_rounds * top50pct + funding_rounds * istop26_50 + funding_rounds *
country_code + funding_rounds * city + funding_rounds * invested_companies +
funding_rounds * funding_total_usd + funding_rounds * relationships + funding_rounds *
top25pct + funding_rounds * istop25 + funding_rounds * name_length + milestones *
aquisition_price_amount + milestones * top50pct + milestones * istop26_50 +
milestones * country_code + milestones * city + milestones * invested_companies +
milestones * funding_total_usd + milestones * relationships + milestones *
top25pct + milestones * istop25 + milestones * name_length + aquisition_price_amount *
top50pct + aquisition_price_amount * istop26_50 + aquisition_price_amount *
country_code + aquisition_price_amount * city + aquisition_price_amount *
invested_companies + aquisition_price_amount * funding_total_usd + aquisition_price_amount *
relationships + aquisition_price_amount * top25pct + aquisition_price_amount *
istop25 + aquisition_price_amount * name_length + top50pct * istop26_50 +
top50pct * country_code + top50pct * city + top50pct * invested_companies +
top50pct * funding_total_usd + top50pct * relationships + top50pct * top25pct +
top50pct * istop25 + top50pct * name_length + istop26_50 * country_code +
istop26_50 * city + istop26_50 * invested_companies + istop26_50 * funding_total_usd +
istop26_50 * relationships + istop26_50 * top25pct + istop26_50 * istop25 +
istop26_50 * name_length + country_code * city + country_code * invested_companies +
country_code * funding_total_usd + country_code * relationships + country_code *
top25pct + country_code * istop25 + country_code * name_length + city *
invested_companies + city * funding_total_usd + city * relationships + city *
top25pct + city * istop25 + city * name_length + invested_companies * funding_total_usd +
invested_companies * relationships + invested_companies * top25pct + invested_companies *
istop25 + invested_companies * name_length + funding_total_usd * relationships +
funding_total_usd * top25pct + funding_total_usd * istop25 + funding_total_usd *
name_length + relationships * top25pct + relationships * istop25 + relationships *
name_length + top25pct * istop25 + top25pct * name_length + istop25 * name_length +
log(funding_total_usd + 1) + log(relationships + 1),
direction = "forward",
trace = 0
)
#Diagnostics and performance of the model
knitr::kable(na.omit(diagnostics_performance(model_transformation_boxcox_selected, response_transform = boxcox_reverse_function(lambda = 0.12), train_actuals = startups_tr$aquisition_price_amount)), digits = 2)
<<<<<<< HEAD
| unlist.res. | |
|---|---|
| coefs | 9.600000e+01 |
| model.pval | 0.000000e+00 |
| r2 | 6.500000e-01 |
| rmse.train | 1.376203e+10 |
| bptest.pval | 0.000000e+00 |
| shapiro.pval | 0.000000e+00 |
| AIC | 1.029215e+04 |
| BIC | 1.085022e+04 |
| adjR2 | 6.300000e-01 |
| rsme_looocv | 4.252724e+10 |
| outliers.num | 1.040000e+02 |
| outliers.pct | 5.000000e-02 |
| influential.num | 8.800000e+01 |
| influential.pct | 4.000000e-02 |
## Evaluate whether removing influential observations help on the model assumption and fit.
knitr::kable(compare_without_influential(model_transformation_boxcox_selected, train_data = startups_tr, test_data = startups_te, test_actuals_col = "aquisition_price_amount", response_transform = boxcox_reverse_function(lambda = 0.12)), digits = 2)
## NULL
## [,1] [,2]
## [1,] 9.600000e+01 0
## [2,] 0.000000e+00 0
## [3,] 6.523166e-01 0
## [4,] 1.376203e+10 0
## [5,] NA 0
## [6,] 8.525834e-60 0
## [7,] 5.991583e-72 0
## [8,] 1.029215e+04 0
## [9,] 1.085022e+04 0
## [10,] 6.347568e-01 0
## [11,] 4.252724e+10 0
## [12,] 1.040000e+02 0
## [13,] 5.260496e-02 0
## [14,] 8.800000e+01 0
## [15,] 4.451189e-02 0
## NULL
## [,1] [,2]
## [1,] 9.600000e+01 9.600000e+01
## [2,] 0.000000e+00 0.000000e+00
## [3,] 6.523166e-01 7.023774e-01
## [4,] 1.376203e+10 5.323183e+09
## [5,] NA NA
## [6,] 8.525834e-60 5.490834e-69
## [7,] 5.991583e-72 6.868054e-70
## [8,] 1.029215e+04 9.260603e+03
## [9,] 1.085022e+04 9.818669e+03
## [10,] 6.347568e-01 6.866082e-01
## [11,] 4.252724e+10 1.178727e+10
## [12,] 1.040000e+02 1.030000e+02
## [13,] 5.260496e-02 5.452620e-02
## [14,] 8.800000e+01 6.700000e+01
## [15,] 4.451189e-02 3.546850e-02
| W/ Influential | W/o Influential | |
|---|---|---|
| unlist.res.1 | 9.600000e+01 | 9.600000e+01 |
| unlist.res.2 | 0.000000e+00 | 0.000000e+00 |
| unlist.res.3 | 6.500000e-01 | 7.000000e-01 |
| unlist.res.4 | 1.376203e+10 | 5.323183e+09 |
| unlist.res.5 | NA | NA |
| unlist.res.6 | 0.000000e+00 | 0.000000e+00 |
| unlist.res.7 | 0.000000e+00 | 0.000000e+00 |
| unlist.res.8 | 1.029215e+04 | 9.260600e+03 |
| unlist.res.9 | 1.085022e+04 | 9.818670e+03 |
| unlist.res.10 | 6.300000e-01 | 6.900000e-01 |
| unlist.res.11 | 4.252724e+10 | 1.178727e+10 |
| unlist.res.12 | 1.040000e+02 | 1.030000e+02 |
| unlist.res.13 | 5.000000e-02 | 5.000000e-02 |
| unlist.res.14 | 8.800000e+01 | 6.700000e+01 |
| unlist.res.15 | 4.000000e-02 | 4.000000e-02 |
The model shows a significant improvement in performance (adjusted r squared, AIC, BIC) but the constant variance assumption is violated again. For this reason, we still prefer the original boxcox model model_transformation_boxcox.
Logarithmic transformation in response
We also explore using a logarithmic transformation in the response variable, using the same predictors originally used for the first model fitted with the boxcox method above.
#Training the model
model_transformation_log = lm(log(aquisition_price_amount) ~ (. -investment_rounds -state_code)^2 + log(funding_total_usd+1) +log(relationships+1) , data = startups_tr)
#Diagnostics and performance of the model
knitr::kable(na.omit(diagnostics_performance(model_transformation_log, response_transform = exp, train_actuals = startups_tr$aquisition_price_amount)), digits = 2)
<<<<<<< HEAD
| unlist.res. | |
|---|---|
| coefs | 6.530000e+02 |
| model.pval | 0.000000e+00 |
| r2 | 6.000000e-01 |
| rmse.train | 1.212812e+09 |
| bptest.pval | 1.000000e+00 |
| shapiro.pval | 0.000000e+00 |
| AIC | 3.930550e+03 |
| BIC | 7.726560e+03 |
| adjR2 | 4.100000e-01 |
| rsme_looocv | Inf |
In R squared and in predicted R squared this model performs marginally worse than the boxcox model, and the constant variance assumption doesn’t seem suspect from the bp test (although we can see it’s not fully met either from the fitted vs residuals plot). However, in AIC and in BIC, this model shows a more significant drop in performance (drop from 3274 to 3931 in AIC and from 7070 to 7727 in BIC). For that reason we prefer the boxcox model model_transformation_boxcox as a transformation-based model
Forward AIC, using lorarigthmic model as the scope
We also use the previously trained log model as the full scope for a forward AIC search.
#Finding the model
model_transformation_log_start = lm(log(aquisition_price_amount) ~ 1, data = startups_tr)
model_transformation_log_selected = step(
model_transformation_log_start,
scope =
log(aquisition_price_amount) ~ category_code * funding_rounds + category_code * milestones +
category_code * aquisition_price_amount + category_code * top50pct + category_code *
istop26_50 + category_code * country_code + category_code * city + category_code *
invested_companies + category_code * funding_total_usd + category_code *
relationships + category_code * top25pct + category_code * istop25 + category_code *
name_length + funding_rounds * milestones + funding_rounds * aquisition_price_amount +
funding_rounds * top50pct + funding_rounds * istop26_50 + funding_rounds *
country_code + funding_rounds * city + funding_rounds * invested_companies +
funding_rounds * funding_total_usd + funding_rounds * relationships + funding_rounds *
top25pct + funding_rounds * istop25 + funding_rounds * name_length + milestones *
aquisition_price_amount + milestones * top50pct + milestones * istop26_50 +
milestones * country_code + milestones * city + milestones * invested_companies +
milestones * funding_total_usd + milestones * relationships + milestones *
top25pct + milestones * istop25 + milestones * name_length + aquisition_price_amount *
top50pct + aquisition_price_amount * istop26_50 + aquisition_price_amount *
country_code + aquisition_price_amount * city + aquisition_price_amount *
invested_companies + aquisition_price_amount * funding_total_usd + aquisition_price_amount *
relationships + aquisition_price_amount * top25pct + aquisition_price_amount *
istop25 + aquisition_price_amount * name_length + top50pct * istop26_50 +
top50pct * country_code + top50pct * city + top50pct * invested_companies +
top50pct * funding_total_usd + top50pct * relationships + top50pct * top25pct +
top50pct * istop25 + top50pct * name_length + istop26_50 * country_code +
istop26_50 * city + istop26_50 * invested_companies + istop26_50 * funding_total_usd +
istop26_50 * relationships + istop26_50 * top25pct + istop26_50 * istop25 +
istop26_50 * name_length + country_code * city + country_code * invested_companies +
country_code * funding_total_usd + country_code * relationships + country_code *
top25pct + country_code * istop25 + country_code * name_length + city *
invested_companies + city * funding_total_usd + city * relationships + city *
top25pct + city * istop25 + city * name_length + invested_companies * funding_total_usd +
invested_companies * relationships + invested_companies * top25pct + invested_companies *
istop25 + invested_companies * name_length + funding_total_usd * relationships +
funding_total_usd * top25pct + funding_total_usd * istop25 + funding_total_usd *
name_length + relationships * top25pct + relationships * istop25 + relationships *
name_length + top25pct * istop25 + top25pct * name_length + istop25 * name_length +
log(funding_total_usd + 1) + log(relationships + 1),
direction = "forward",
trace = 0
)
#Diagnostics and performance of the model
diagnostics_performance(model_transformation_log_selected)
<<<<<<< HEAD
## unlist.res.
## coefs 8.900000e+01
## model.pval 4.480644e-295
## r2 5.843293e-01
## rmse.train 1.989951e+00
## rmse.test NA
## bptest.pval 5.368126e-07
## shapiro.pval 1.174242e-31
## AIC 2.898787e+03
## BIC 3.416161e+03
## adjR2 5.649548e-01
## rsme_looocv 2.368853e+00
## outliers.num 1.020000e+02
## outliers.pct 5.159332e-02
## influential.num 7.300000e+01
## influential.pct 3.692463e-02
#Number of predictors
length(coef(model_transformation_log_selected))
## [1] 89
Although this model shows improvements in AIC and BIC, other metrics do not look as good. And the constant variance assumption is more suspect now. For that reason the original logarithmic model model_transformation_log seems better, and overall, we still prefer the original boxcox model model_transformation_boxcox as a transformation-based model.
In summary, the two preferred models are: interaction_model (interaction), model_transformation_boxcox (transformation). We are omitting model_additive_selected_AIC (additive) because of its low r squared.
#for interaction_model and model_additive_selected_AIC,no response transformation
<<<<<<< HEAD
simple_function = function(x) {
x
}
compare_model_performance(
models = list(interaction_model, model_additive_selected_AIC),
predict_transforms = list(simple_function, simple_function),
test_data = startups_te,
test_actuals_col = "aquisition_price_amount",
train_actuals = startups_tr$aquisition_price_amount
)
=======
simple_function = function(x){x}
compare_model_performance(
models = list(interaction_model, model_additive_selected_AIC),
predict_transforms = list(simple_function, simple_function),
test_data = startups_te, test_actuals_col = "aquisition_price_amount", train_actuals = startups_tr$aquisition_price_amount)
>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1
## Warning in predict.lm(model, test_data): prediction from a rank-deficient fit
## may be misleading
## NULL
## [,1] [,2]
## [1,] 6.510000e+02 0
## [2,] 2.889544e-160 0
## [3,] 7.390411e-01 0
## [4,] 9.380137e+08 0
## [5,] NA 0
## [6,] 7.169900e-05 0
## [7,] 8.984570e-55 0
## [8,] 8.298877e+04 0
## [9,] 8.677316e+04 0
## [10,] 6.111201e-01 0
## [11,] NaN 0
## [12,] NA 0
## [13,] NA 0
## [14,] NA 0
## [15,] NA 0
## NULL
## [,1] [,2]
## [1,] 6.510000e+02 1.800000e+01
## [2,] 2.889544e-160 3.474939e-45
## [3,] 7.390411e-01 1.232725e-01
## [4,] 9.380137e+08 1.719315e+09
## [5,] NA NA
## [6,] 7.169900e-05 3.098988e-19
## [7,] 8.984570e-55 8.583976e-65
## [8,] 8.298877e+04 8.411857e+04
## [9,] 8.677316e+04 8.422321e+04
## [10,] 6.111201e-01 1.156643e-01
## [11,] NaN 1.757006e+09
## [12,] NA 3.900000e+01
## [13,] NA 1.972686e-02
## [14,] NA 5.000000e+01
## [15,] NA 2.529084e-02
## [,1] [,2]
## unlist.res.1 6.510000e+02 1.800000e+01
## unlist.res.2 2.889544e-160 3.474939e-45
## unlist.res.3 7.390411e-01 1.232725e-01
## unlist.res.4 9.380137e+08 1.719315e+09
## unlist.res.5 NA NA
## unlist.res.6 7.169900e-05 3.098988e-19
## unlist.res.7 8.984570e-55 8.583976e-65
## unlist.res.8 8.298877e+04 8.411857e+04
## unlist.res.9 8.677316e+04 8.422321e+04
## unlist.res.10 6.111201e-01 1.156643e-01
## unlist.res.11 NaN 1.757006e+09
## unlist.res.12 NA 3.900000e+01
## unlist.res.13 NA 1.972686e-02
## unlist.res.14 NA 5.000000e+01
## unlist.res.15 NA 2.529084e-02
The startup data comes from Kaggle: [https://www.kaggle.com/justinas/startup-investments]
This dataset contains information about the startup ecosystem: organizations, individuals, company news, funding rounds, acquisitions, and IPOs, extracted originally from the Crunchabse Data website. In this project, only 6 of the 11 tables will be used and joined using unique IDs.
library(readr)
#People
people = read_csv("./people.csv")
colnames(people)[2] = "person_id"
attr(people, "spec") = NULL
colnames(people)
degrees = read_csv("./degrees.csv")
colnames(degrees)[2] = "person_id"
attr(degrees, "spec") = NULL
colnames(degrees)
#Company
companies = read_csv("./objects.csv")
attr(companies, "spec") = NULL
attr(companies, "problems") = NULL
colnames(companies)[1] = "company_id"
colnames(companies)
comp_peep = read_csv("./relationships.csv")
colnames(comp_peep)[3] = "person_id"
colnames(comp_peep)[4] = "company_id"
colnames(comp_peep)
attr(comp_peep, "spec") = NULL
#str(comp_peep)
aquisitions = read_csv("./acquisitions.csv")
colnames(aquisitions)[4] = "company_id"
colnames(aquisitions)[6] = "aquisition_price_amount"
attr(aquisitions, "spec") = NULL
colnames(aquisitions)
#Filter out companies with aquisition amount not in USD (all other variable/predictors are in USD)
aquisitions = aquisitions[aquisitions$price_currency_code == "USD",c("company_id", "aquisition_price_amount")]
After loading the individual tables, we need to join all into a single dataset that can be used for analysis.
The people, degrees and their relationship to the companies can all be joined into a sigle people dimension table. Additionally, we can extract potential relevant predictors for regression. For every person, generally an executive, we identify in each company people that attended to any of the top 25 or top 50 universities in the world (“top 50” excludes the top 25 universities). Also, we count the number of people in each company as well as the count of people by company that attended the top 25 and top 50 universities. Finally, to normalize these numbers, we also additional columns that represent percentage of people in each company that attended the top 25 and top 50 universities.
#Auxiliary functions to count rows for analysis of data
count_disticts = function(x){
length(unique(x))
}
counts = function(x){
length(x)
}
#Join people with their degrees information and with which company these people belong to.
full_peep = merge(people, degrees, by="person_id")
founders = merge(full_peep, comp_peep, by="person_id")
top1 = grep(".*University of Oxford.*|.*xford.*|.*California Institute of Technology.*|.*altech|.*University of Cambridge.*|.*ambridge.*|.*Stanford University.*|.*tanford.*|.*Massachusetts Institute of Technology.*|.*MIT.*|.*Princeton University.*|.*inceton.*|.*Harvard University.*| .*arvard#|.*Yale University.*|.*Yale.*|.*University of Chicago.*|.*Imperial College London.*|.*University of Pennsylvania.*|.*Johns Hopkins University.*|.*University of California Berkeley.*|.*Berkley.*|.*ETH Zurich.*|.*UCL.*|.*Columbia University.*|.*Columbia.*|.*University of California Los Angeles.*|.*University of Toronto.*|.*Cornell University.*|.*Cornell.*|.*Duke University.*|.*University of Michigan-Ann Arbor.*|.*Northwestern University.*|.*Tsinghua University.*|.*Peking University.*|.*National University of Singapore.*", ignore.case = TRUE, x = founders$institution)
top2 = grep(".*University of Washington.*|.*Carnegie Mellon University.*|.*Carnegie.*|.*London School of Economics and Political Science.*|.*LSE.*|.*New York University.*|.*University of Edinburgh.*|.*University of California San Diego.*|.*LMU Munich LMU.*|.*University of Melbourne.*|.*University of British Columbia.*|.*University of Hong Kong.*|.*King’s College London Kings College London.*|.*The University of Tokyo.*|.*École Polytechnique Fédérale de Lausanne.*|.*Lausanne.*|.*Georgia Institute of Technology.*|.*University of Texas at Austin.*|.*Karolinska Institute.*|.*McGill University.*|.*Technical University of Munich.*|.*Heidelberg University.*|.*KU Leuven.*|.*Paris Sciences et Lettres – PSL Research University Paris.*|.*PSL.*|.*The Hong Kong University of Science and Technology.*|.*University of Illinois at Urbana-Champaign.*|.*University of Illinois.*|.*Urbana-Champaign.*|.*Urbana.*|.*Champaign.*|.*Nanyang Technological University .*|.*Australian National University.*", ignore.case = TRUE, x = founders$institution)
top25 = rep(0, nrow(founders))
top50 = rep(0, nrow(founders))
top25[top1] = 1
top50[top2] = 1
founders["top25"] = top25
founders["top50"] = top50
founders = aggregate(x = founders[,c("top25", "top50")], by = list(company_id = founders$company_id, person_id = founders$person_id), FUN = max)
peeps = aggregate(x = founders[,c("company_id", "person_id")], by = list(company_id = founders$company_id), FUN = counts)
peeps = peeps[,c(1,2)]
colnames(peeps)[2] = "people"
founders = aggregate(x = founders[,c("top25", "top50")], by = list(company_id = founders$company_id), FUN = sum)
founders = merge(founders, peeps[,c(1,2)], by="company_id")
founders$top25pct = round(100*founders$top25/founders$people)
founders$top50pct = round(100*founders$top50/founders$people)
founders = na.omit(founders[,-c(2,3,4)])
After extracting the individual data aspects (people, company, investments) of the companies, these aspects are joined into a single dataset. The dataset that will be used for this project will depend on the resulting row numbers after different combination of joins.
company_aq = merge(companies, aquisitions, by = "company_id")
#nrow(company_aq)
company_aq_lds = merge(x = company_aq, y = founders, by = "company_id", all.x = TRUE)
#nrow(company_aq_lds)
To ensure that the dataset is viable for linear regression analysis, we create a preliminary linear model with company aquisition price as the response variable, and several other potential columns in the startups dataset as predictors. We are using the company_aq dataset above.
# Data set to be analized, filtering out aquisition values without aquisition price
startups = company_aq_lds[company_aq_lds$aquisition_price_amount>0, ]
nrow(startups)
col_remove = c("entity_type", "entity_id", "parent_id", "name", "permalink", "founded_at", "closed_at", "domain", "homepage_url", "twitter_username", "logo_url", "short_description", "description", "overview", "tag_list", "region", "first_investment_at", "last_investment_at", "first_funding_at", "last_funding_at", "first_milestone_at", "last_milestone_at", "created_by", "created_at", "updated_at")
startups = startups[,-which(colnames(startups) %in% col_remove)]
#Nulls by column
show_nas = function(somedf){
cols_nas = data.frame(column = colnames(somedf), numNAs = rep(0, ncol(somedf)))
for(i in 1:ncol(somedf)){
cols_nas[i, "numNAs"] = length(somedf[is.na(somedf[,i]),i])
}
cols_nas$pctNAs = round(100*cols_nas$numNAs / nrow(somedf), digits = 1)
cols_nas
}
show_nas(startups)
We clean NA values below:
#clean NAs; introduce "Unknown" for categorical variables where ther are N/As
startups[is.na(startups[,"category_code"]),"category_code"] = "unknown"
startups[is.na(startups[,"country_code"]),"country_code"] = "unknown"
startups[is.na(startups[,"city"]),"city"] = "unknown"
startups[is.na(startups[,"state_code"]),"state_code"] = "unknown"
#For numeric variablestop25pct, top50pct; will convert all of these into a categorical variable that says if people in company attended universities in top 25 universities in worl, then 1. Similarly, if company people (executives), attended universites in the top 26 to 50 universites `istop26_50`, 1; else, 0.
startups[is.na(startups[,"top25pct"]),"top25pct"] = 0
startups[is.na(startups[,"top50pct"]),"top50pct"] = 0
startups$istop25 = ifelse(startups$top25pct > 0, 1, 0)
startups$istop26_50 = ifelse(startups$top50pct > 0, 1, 0)
#Add a few additional predictors
startups$name_length = nchar(startups$normalized_name)
show_nas(startups)
With clean data, we can now perform the data split for training and data
#Generate the full data
write_csv(x = startups, path = "./startups.csv")
set.seed(420)
train_index = sample(1:nrow(startups), size = floor(0.8 * nrow(startups)))
startups_tr = startups[train_index,]
startups_te = startups[-train_index,]
#Save the train and test datasest
write_csv(x = startups_tr, path = "./startups_tr.csv")
write_csv(x = startups_te, path = "./startups_te.csv")
The dataset to be used, startups has 2472 rows of data.
A sample output of the data is below:
| aquisition_price_amount | city | funding_total_usd | category_code | funding_rounds | milestones | relationships |
|---|---|---|---|---|---|---|
| 20000000 | other | 0 | games_video | 0 | 0 | 6 |
| 47500000 | mountain view | 5000000 | web | 1 | 3 | 14 |
| 15000000 | san francisco | 3000000 | games_video | 1 | 2 | 6 |
| 262500000 | other | 274999 | hardware | 1 | 0 | 4 |
| 36500000 | other | 15286415 | other | 3 | 1 | 2 |
| 100000000 | other | 28000000 | hardware | 1 | 3 | 7 |
Moreover, a very simple model illustrates feasibility for linear regression analysis:
#Linear Model
company_model = lm(log(aquisition_price_amount) ~ log(funding_total_usd+1) + category_code + funding_rounds + milestones + relationships + istop25 + istop26_50, data=startups)
summary(company_model)
##
## Call:
## lm(formula = log(aquisition_price_amount) ~ log(funding_total_usd +
## 1) + category_code + funding_rounds + milestones + relationships +
## istop25 + istop26_50, data = startups)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.3812 -1.1325 0.2108 1.6253 8.0076
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.574302 0.300837 55.094 < 2e-16 ***
## log(funding_total_usd + 1) 0.002897 0.011861 0.244 0.807083
## category_codebiotech 1.422294 0.336273 4.230 2.43e-05 ***
## category_codeenterprise 0.240485 0.378627 0.635 0.525389
## category_codegames_video -0.010138 0.403346 -0.025 0.979949
## category_codehardware 0.628013 0.417439 1.504 0.132596
## category_codemobile 0.174453 0.378273 0.461 0.644708
## category_codenetwork_hosting 0.306121 0.438431 0.698 0.485107
## category_codeother 0.781903 0.319944 2.444 0.014601 *
## category_codepublic_relations 0.778358 0.422489 1.842 0.065549 .
## category_codesemiconductor 0.873731 0.428455 2.039 0.041531 *
## category_codesoftware 0.420082 0.318961 1.317 0.187951
## category_codeunknown -1.909558 0.320396 -5.960 2.89e-09 ***
## category_codeweb -0.207972 0.333049 -0.624 0.532391
## funding_rounds -0.108816 0.071172 -1.529 0.126414
## milestones 0.171153 0.059412 2.881 0.004001 **
## relationships 0.024988 0.004549 5.494 4.35e-08 ***
## istop25 1.202162 0.137580 8.738 < 2e-16 ***
## istop26_50 0.627238 0.168145 3.730 0.000196 ***
## ---
<<<<<<< HEAD
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
=======
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1
##
## Residual standard error: 2.637 on 2453 degrees of freedom
## Multiple R-squared: 0.2632, Adjusted R-squared: 0.2578
## F-statistic: 48.69 on 18 and 2453 DF, p-value: < 2.2e-16
In addition, we can see how the linear regression assumptions of normality and constant variance are with the simple model. In the study, will better adjust tthe model in question to more appropriately describe the data with the model including its variance.
#Plots of the model
par(mfrow = c(1,2))
plot(company_model, which = 1:2)
<<<<<<< HEAD